import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support, classification_report
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
from pandas.api.types import is_numeric_dtype
import seaborn as sb
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import optuna
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from tabulate import tabulate
from sklearn.metrics import accuracy_score, precision_score, recall_score
import time
Our dataset has 2 files provided already for Train and Test but we chose to use the Train Set only which we will split later on to make sure the split is truly random
intrusion = pd.read_csv("network.csv")
intrusionog=intrusion.copy()
Preliminary Exploration
Description of the Numerical Features
intrusion.describe()
| duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 25192.000000 | 2.519200e+04 | 2.519200e+04 | 25192.000000 | 25192.000000 | 25192.00000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | ... | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 | 25192.000000 |
| mean | 305.054104 | 2.433063e+04 | 3.491847e+03 | 0.000079 | 0.023738 | 0.00004 | 0.198039 | 0.001191 | 0.394768 | 0.227850 | ... | 182.532074 | 115.063036 | 0.519791 | 0.082539 | 0.147453 | 0.031844 | 0.285800 | 0.279846 | 0.117800 | 0.118769 |
| std | 2686.555640 | 2.410805e+06 | 8.883072e+04 | 0.008910 | 0.260221 | 0.00630 | 2.154202 | 0.045418 | 0.488811 | 10.417352 | ... | 98.993895 | 110.646850 | 0.448944 | 0.187191 | 0.308367 | 0.110575 | 0.445316 | 0.446075 | 0.305869 | 0.317333 |
| min | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 84.000000 | 10.000000 | 0.050000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 4.400000e+01 | 0.000000e+00 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 255.000000 | 61.000000 | 0.510000 | 0.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 2.790000e+02 | 5.302500e+02 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 0.070000 | 0.060000 | 0.020000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| max | 42862.000000 | 3.817091e+08 | 5.151385e+06 | 1.000000 | 3.000000 | 1.00000 | 77.000000 | 4.000000 | 1.000000 | 884.000000 | ... | 255.000000 | 255.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 38 columns
Description of the Categorical Features
intrusion.describe(include = 'object')
| protocol_type | service | flag | class | |
|---|---|---|---|---|
| count | 25192 | 25192 | 25192 | 25192 |
| unique | 3 | 66 | 11 | 2 |
| top | tcp | http | SF | normal |
| freq | 20526 | 8003 | 14973 | 13449 |
individually storing class to use later
clss = pd.DataFrame(intrusion['class'])
A look at the object types of each variable
intrusion.dtypes
duration int64 protocol_type object service object flag object src_bytes int64 dst_bytes int64 land int64 wrong_fragment int64 urgent int64 hot int64 num_failed_logins int64 logged_in int64 num_compromised int64 root_shell int64 su_attempted int64 num_root int64 num_file_creations int64 num_shells int64 num_access_files int64 num_outbound_cmds int64 is_host_login int64 is_guest_login int64 count int64 srv_count int64 serror_rate float64 srv_serror_rate float64 rerror_rate float64 srv_rerror_rate float64 same_srv_rate float64 diff_srv_rate float64 srv_diff_host_rate float64 dst_host_count int64 dst_host_srv_count int64 dst_host_same_srv_rate float64 dst_host_diff_srv_rate float64 dst_host_same_src_port_rate float64 dst_host_srv_diff_host_rate float64 dst_host_serror_rate float64 dst_host_srv_serror_rate float64 dst_host_rerror_rate float64 dst_host_srv_rerror_rate float64 class object dtype: object
We check for missing values to ensure that the dataset is complete and accurate as it may lead to biased or inaccurate results and errors
intrusion.isnull().values.any()
False
no NAN values in dataset
Next, we check for duplicate rows as it may also create bias
print(f"Number of duplicate rows: {intrusion.duplicated().sum()}")
Number of duplicate rows: 0
Our dataset consists of 41 features (independent var) and a class label (dependant var). The 41 features can be grouped into:
-Basic(9): Derived from header information of network packets. Eg: duration, protocol_type, service, etc.
-Content(13): Derived from packet payload. Eg: hot, num_failed_logins, logged_in, etc
-Traffic(9): Capture the behaviour of connecttions between same source and destination hosts. Eg: count, serror_rate, rerro_rate, etc.
-Traffic between different hosts(10): capture the behavior of connections between different source and destination hosts. Eg: dst_host_count, dst_host_srv_count, etc.
1. Exploring the Categorical Features
=> To analyse the distribution of values within each feature and observe any patterns
categorical_features = ['protocol_type' , 'service', 'flag']
categorical_featureswclass = ['protocol_type' , 'service', 'flag','class']
for v in categorical_features:
print(f"=====Unique values of {v}=====")
unique_val = intrusion[v].unique()
print(unique_val)
print(f"Number of unique values: {len(unique_val)}\n")
=====Unique values of protocol_type===== ['tcp' 'udp' 'icmp'] Number of unique values: 3 =====Unique values of service===== ['ftp_data' 'other' 'private' 'http' 'remote_job' 'name' 'netbios_ns' 'eco_i' 'mtp' 'telnet' 'finger' 'domain_u' 'supdup' 'uucp_path' 'Z39_50' 'smtp' 'csnet_ns' 'uucp' 'netbios_dgm' 'urp_i' 'auth' 'domain' 'ftp' 'bgp' 'ldap' 'ecr_i' 'gopher' 'vmnet' 'systat' 'http_443' 'efs' 'whois' 'imap4' 'iso_tsap' 'echo' 'klogin' 'link' 'sunrpc' 'login' 'kshell' 'sql_net' 'time' 'hostnames' 'exec' 'ntp_u' 'discard' 'nntp' 'courier' 'ctf' 'ssh' 'daytime' 'shell' 'netstat' 'pop_3' 'nnsp' 'IRC' 'pop_2' 'printer' 'tim_i' 'pm_dump' 'red_i' 'netbios_ssn' 'rje' 'X11' 'urh_i' 'http_8001'] Number of unique values: 66 =====Unique values of flag===== ['SF' 'S0' 'REJ' 'RSTR' 'SH' 'RSTO' 'S1' 'RSTOS0' 'S3' 'S2' 'OTH'] Number of unique values: 11
for v in categorical_features:
fig = go.Figure()
fig.add_trace(go.Histogram(
x=intrusion[v],
nbinsx=len(intrusion[v].unique()),
histnorm='percent'
))
fig.update_layout(
title=f"Histogram of {v}",
xaxis_title=v,
yaxis_title='Percentage',
)
fig.show()
=> There are 3 categorical variables, protocol_type, service, flag. We explored the unique values each variable has, its frequency and most occurring value as well. We then plotted histograms to visualise the distribution of data for each individual variable better. We can observe that Tcp, http and SF are dominant values in their respective variables.
#REMOVE
#for col in intrusion:
#if col != 'class' and is_numeric_dtype(intrusion[col]):
#fig, ax = plt.subplots(2, 1, figsize=(12, 8))
# g1 = sb.boxplot(x = intrusion[col], ax=ax[0])
# plt.show()
2. Numerical Features
=> To identify any strong linear relationships between pairs of numerical features which can help identify redundant features or multicollinearity, which might affect the performance of some machine learning models.
A. Correlation Analysis
plt.figure(figsize=(40,30))
sb.heatmap(intrusion.corr(), annot=True)
# import plotly.express as px
# fig = px.imshow(df.corr(), text_auto=True, aspect="auto")
# fig.show()
<AxesSubplot:>
B. Histograms
Next, the univariate histograms of the features are plotted to give us preliminary understanding about the distributions.
num_features = [column for column in intrusion.columns if column not in categorical_featureswclass]
subplots_layout = dict(rows=10, cols=4, subplot_titles=num_features)
subplot_fig = make_subplots(**subplots_layout)
finished = False
for row_idx in range(1, 11):
if finished:
break
for col_idx in range(1, 5):
feature_position = 4 * (row_idx - 1) + (col_idx - 1)
if feature_position >= len(num_features):
finished = True
break
current_feature = num_features[feature_position]
feature_data = intrusion[current_feature]
histogram_trace = go.Histogram(x=feature_data, name=current_feature)
subplot_fig.add_trace(
histogram_trace,
row=row_idx,
col=col_idx
)
subplot_fig.update_layout(height=1200, title_text="Distribution of Numeric Features (Univariate)")
subplot_fig.show()
C. Statistical Dispersion and Variation
=> Based on the observations of distributions above, some features behave as if they're constant features. We will look at two measures which will provide useful information about the dataset's characteristics
The proportion of the value with the most count in each feature
The variance of each feature
n_samples = intrusion.shape[0] # Total number of samples
# Get the proportion of the value with the most count in each feature
max_proportions = pd.DataFrame()
for f in num_features:
feature_series = intrusion[f]
max_proportion = np.max(feature_series.value_counts()) / n_samples
max_proportions[f] = [max_proportion]
max_proportions.index = ["Max Proportion"]
# Get the variance of each feature
vars = pd.DataFrame(intrusion.var()).T
vars.index = ["Variance"]
disp_and_var = max_proportions.append(vars)
print("=====Statistical dispersion and variation=====")
display(disp_and_var)
=====Statistical dispersion and variation=====
| duration | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | num_failed_logins | logged_in | num_compromised | ... | dst_host_count | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Max Proportion | 9.196570e-01 | 3.916323e-01 | 5.388218e-01 | 0.999921 | 0.991108 | 0.99996 | 0.979359 | 0.999087 | 0.605232 | 0.989203 | ... | 0.589473 | 0.283741 | 0.387345 | 0.370872 | 0.503057 | 0.690179 | 0.643895 | 0.675016 | 0.821213 | 0.847452 |
| Variance | 7.217581e+06 | 5.811983e+12 | 7.890897e+09 | 0.000079 | 0.067715 | 0.00004 | 4.640585 | 0.002063 | 0.238936 | 108.521223 | ... | 9799.791278 | 12242.725493 | 0.201551 | 0.035041 | 0.095090 | 0.012227 | 0.198307 | 0.198983 | 0.093556 | 0.100701 |
2 rows × 38 columns
3, Bivariate Analysis
=> Before we begin the Bivariate Analysis, we decided to filter out the features with high max proportion or low variance based on the pre-defined thresholds to simplify the analysis. The thresholds are defined as follows:
=> We filter out these features to help reduce noise, improve accuracy and simplify the dataset by focusing on the most relevant features.
=> Next we plot scatter plots against the features in randomly picked pairs. This is to better visualise any potential linear relationships between any 2 features and quantify the strength of their relationship if needed.
# Filter out features with high "max proportion" or low "variance"
disp_and_var_T = disp_and_var.T # Take the transpose
features_remained = disp_and_var_T[(disp_and_var_T['Max Proportion'] < 0.99) & disp_and_var_T['Variance'] > 0.001].index.tolist()
intrusion = intrusion.loc[:, features_remained]
print(f"After filtering, there are {len(features_remained)} numeric features remained.")
# Plot bivariate distributions
features_picked = features_remained[-5:]
df_train = intrusion.loc[:, features_picked]
df_train['gt'] = clss
fig = px.scatter_matrix(df_train,
dimensions=features_picked,
color="gt",
symbol="gt")
fig.update_traces(diagonal_visible=False)
fig.update_layout(height=1200, title_text="Bivariate Distribution of Numeric Feature Pairs (Randomly Picked)")
fig.show()
After filtering, there are 25 numeric features remained.
4. Checking for Class Imbalance
=> We plot to see if there is an issue of class imbalance using the class column provided in the dataset.
class_count = pd.DataFrame(clss).value_counts().reset_index()
class_count.columns = ['Class', 'Count']
fig = px.bar(class_count,
x='Class',
y='Count',
color='Class',
text='Count',
labels={'Class': 'Class', 'Count': 'Count'},
title='Histogram of Classes')
fig.update_layout(xaxis_title='Class', yaxis_title='Count', showlegend=False)
fig.update_traces(textposition='outside')
fig.show()
We can observe that the the classes of ‘normal’ and ‘anomaly’ are generally evenly distributed. and there is no severe issue of Class Imbalance. Anamoly in this cases refers to accesses classified as potentially dangerous
1. Standardisation of the dataset as they might behave badly if the individual features do not more or less look like the standard normally distributed data, eg. for K-Nearest-Neighbours
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# extract numerical attributes and scale it to have zero mean and unit variance
cols = intrusionog.select_dtypes(include=['float64','int64']).columns
sc_intrusion = scaler.fit_transform(intrusionog.select_dtypes(include=['float64','int64']))
# turn the result back to a dataframe
sc_intrusiondf = pd.DataFrame(sc_intrusion, columns = cols)
2. Encoding categorical attributes to make it compatible with numeric data when using models
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# extract categorical attributes
catintrusion = intrusionog.select_dtypes(include=['object']).copy()
# encode the categorical attributes
intrusioncat = catintrusion.apply(encoder.fit_transform)
# separate target column from encoded data
encintrusion = intrusioncat.drop(['class'], axis=1)
cat_intrusion = intrusioncat[['class']].copy()
intrusion_x = pd.concat([sc_intrusiondf,encintrusion],axis=1)
intrusion_y = intrusionog['class']
intrusion_x.shape
(25192, 41)
3. Performing Feature Importance
=> Using Random Forests and Recursive Feature Elimination to rank the importance of features to identify the most relevant ones for the anomaly detection to reduce the complexity of the dataset.
=> Feature selection can also improve the performance of the model by reducing the risk of overfitting
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier();
# fit random forest classifier on the training set
rfc.fit(intrusion_x, intrusion_y);
# extract important features
score = np.round(rfc.feature_importances_,3)
importances = pd.DataFrame({'feature':intrusion_x.columns,'importance':score})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
# plot importances
plt.rcParams['figure.figsize'] = (11, 4)
importances.plot.bar();
from sklearn.feature_selection import RFE
import itertools
rfc = RandomForestClassifier()
# create the RFE model and select 10 attributes
rfe = RFE(rfc, n_features_to_select=15)
rfe = rfe.fit(intrusion_x, intrusion_y)
# summarize the selection of the attributes
feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), intrusion_x.columns)]
selected_features = [v for i, v in feature_map if i==True]
selected_features
['src_bytes', 'dst_bytes', 'logged_in', 'count', 'srv_count', 'same_srv_rate', 'diff_srv_rate', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'protocol_type', 'service', 'flag']
=> The random forest algorithm builds a forest in the form of an ensemble of decision trees, which adds more randomness while growing the trees. While splitting a node, the algorithm searches for the best features from the random subset of features, which adds more diversity, resulting in a better model.
=> RFE stands for Recursive Feature Elimination. It is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.
=> src_bytes is the most important feature
We focused on the top 15 features selected by the random forest classifier.
4. Train Test Split
X_train,X_test,Y_train,Y_test = train_test_split(intrusion_x[selected_features],intrusion_y,train_size=0.80, random_state=2)
X_train.shape
(20153, 15)
X_test.shape
(5039, 15)
Y_train.shape
(20153,)
Y_test.shape
(5039,)
1. Decision tree (Baseline Model)
=> We chose Decision Tree as our baseline because of its versatility and that it doesnt require a lot of assumptions that need to be fulfilled for the model to work efficiently. It is also easily interpretable.
dt = DecisionTreeClassifier(max_depth=4) #take an arbitrary value first
# Measure the time taken to fit the model
start_time = time.time()
dt.fit(X_train, Y_train)
fit_timedtog = time.time() - start_time
# Measure the accuracy of the model on the training and test set
dt_trainog, dt_testog = dt.score(X_train, Y_train), dt.score(X_test, Y_test)
# Measure the time taken to generate predictions
start_time = time.time()
y_pred = dt.predict(X_test)
predict_timedtog = time.time() - start_time
# Calculate the precision score
precisiondtog = precision_score(Y_test, y_pred, average='macro')
# Print the accuracy and time taken for fitting and predicting
print(f"Train Score: {dt_trainog}")
print(f"Test Score: {dt_testog}")
print(f"Precision: {precisiondtog}")
print(f"Time taken to fit the model: {fit_timedtog:.10f} seconds")
print(f"Time taken to generate predictions: {predict_timedtog:.10f} seconds")
Train Score: 0.9770257529896293 Test Score: 0.9740027783290336 Precision: 0.9735761882932672 Time taken to fit the model: 0.0400362015 seconds Time taken to generate predictions: 0.0010008812 seconds
fig = plt.figure(figsize = (30,12))
tree.plot_tree(dt, filled=True);
plt.show()
=> The depth we chose was an arbitrary number of 4, taken from a random number generator of integers 1 to 10.
=> The results were fairly accurate, but we aimed to make it better by using Optuna to optimize the decision tree based on the parameters of max depth and max features.
2. Decision tree (Optimised)
def objective_DT(trial):
max_depth = trial.suggest_int('max_depth', 2, 32, log=False)
max_features = trial.suggest_int('max_features', 2, 10, log=False)
model = DecisionTreeClassifier(max_features=max_features, max_depth=max_depth)
model.fit(X_train, Y_train)
accuracy = model.score(X_test, Y_test)
return accuracy
Hyperparameter Tuning:
For two hyperparameters, dt_max_depth and dt_max_features, the trial.suggest_int function is used to suggest integer values that fall within a specified range. The decision tree's maximum depth is specified by the max_depth hyperparameter, while the number of features that can be used to divide a node is specified by the max_features hyperparameter.
Using the specified hyperparameters, a new instance of the DecisionTreeClassifier class is constructed and fitted to the training set of data. The scoring technique is then used to the test data to determine the classifier's accuracy.
The accuracy value is displayed as the process' goal for tuning the hyperparameters. Finding the set of hyperparameters that maximizes the Decision Tree classifier's accuracy on the test data is the aim of the hyperparameter tuning process.
study_dt = optuna.create_study(direction='maximize')
# Measure the time taken to run the Optuna study
start_time = time.time()
study_dt.optimize(objective_DT, n_trials=30)
optuna_timedt = time.time() - start_time
# Print the best trial from the Optuna study and the time taken to run the study
print(f"Best trial: {study_dt.best_trial}")
print(f"Time taken to run Optuna study: {optuna_timedt:.2f} seconds")
[I 2023-04-23 20:44:30,113] A new study created in memory with name: no-name-94d8c896-fe05-41b9-96e3-300db37d37ef [I 2023-04-23 20:44:30,153] Trial 0 finished with value: 0.9416550902956936 and parameters: {'max_depth': 2, 'max_features': 8}. Best is trial 0 with value: 0.9416550902956936. [I 2023-04-23 20:44:30,185] Trial 1 finished with value: 0.9924588211946815 and parameters: {'max_depth': 18, 'max_features': 2}. Best is trial 1 with value: 0.9924588211946815. [I 2023-04-23 20:44:30,227] Trial 2 finished with value: 0.9916650128993848 and parameters: {'max_depth': 12, 'max_features': 5}. Best is trial 1 with value: 0.9924588211946815. [I 2023-04-23 20:44:30,269] Trial 3 finished with value: 0.986108354832308 and parameters: {'max_depth': 7, 'max_features': 6}. Best is trial 1 with value: 0.9924588211946815. [I 2023-04-23 20:44:30,319] Trial 4 finished with value: 0.9956340543758683 and parameters: {'max_depth': 23, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,351] Trial 5 finished with value: 0.9910696566779122 and parameters: {'max_depth': 28, 'max_features': 2}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,393] Trial 6 finished with value: 0.9853145465370113 and parameters: {'max_depth': 7, 'max_features': 7}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,452] Trial 7 finished with value: 0.9924588211946815 and parameters: {'max_depth': 18, 'max_features': 10}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,490] Trial 8 finished with value: 0.991863464973209 and parameters: {'max_depth': 31, 'max_features': 4}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,522] Trial 9 finished with value: 0.9892835880134947 and parameters: {'max_depth': 14, 'max_features': 2}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,588] Trial 10 finished with value: 0.9950386981543957 and parameters: {'max_depth': 25, 'max_features': 10}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,647] Trial 11 finished with value: 0.9944433419329232 and parameters: {'max_depth': 24, 'max_features': 10}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,711] Trial 12 finished with value: 0.9944433419329232 and parameters: {'max_depth': 24, 'max_features': 9}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,770] Trial 13 finished with value: 0.9944433419329232 and parameters: {'max_depth': 23, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,835] Trial 14 finished with value: 0.9938479857114507 and parameters: {'max_depth': 28, 'max_features': 9}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,889] Trial 15 finished with value: 0.9928557253423298 and parameters: {'max_depth': 20, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:30,952] Trial 16 finished with value: 0.9916650128993848 and parameters: {'max_depth': 32, 'max_features': 10}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,005] Trial 17 finished with value: 0.9924588211946815 and parameters: {'max_depth': 27, 'max_features': 7}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,067] Trial 18 finished with value: 0.9938479857114507 and parameters: {'max_depth': 20, 'max_features': 9}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,124] Trial 19 finished with value: 0.9924588211946815 and parameters: {'max_depth': 14, 'max_features': 7}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,170] Trial 20 finished with value: 0.993054177416154 and parameters: {'max_depth': 22, 'max_features': 4}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,234] Trial 21 finished with value: 0.9944433419329232 and parameters: {'max_depth': 25, 'max_features': 10}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,302] Trial 22 finished with value: 0.9946417940067473 and parameters: {'max_depth': 26, 'max_features': 9}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,366] Trial 23 finished with value: 0.9944433419329232 and parameters: {'max_depth': 29, 'max_features': 9}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,419] Trial 24 finished with value: 0.9948402460805715 and parameters: {'max_depth': 26, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,470] Trial 25 finished with value: 0.9924588211946815 and parameters: {'max_depth': 21, 'max_features': 6}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,532] Trial 26 finished with value: 0.9940464377852749 and parameters: {'max_depth': 29, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,587] Trial 27 finished with value: 0.9938479857114507 and parameters: {'max_depth': 16, 'max_features': 7}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,636] Trial 28 finished with value: 0.9924588211946815 and parameters: {'max_depth': 26, 'max_features': 5}. Best is trial 4 with value: 0.9956340543758683. [I 2023-04-23 20:44:31,695] Trial 29 finished with value: 0.9924588211946815 and parameters: {'max_depth': 22, 'max_features': 8}. Best is trial 4 with value: 0.9956340543758683.
Best trial: FrozenTrial(number=4, state=TrialState.COMPLETE, values=[0.9956340543758683], datetime_start=datetime.datetime(2023, 4, 23, 20, 44, 30, 269652), datetime_complete=datetime.datetime(2023, 4, 23, 20, 44, 30, 319698), params={'max_depth': 23, 'max_features': 8}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=32, log=False, low=2, step=1), 'max_features': IntDistribution(high=10, log=False, low=2, step=1)}, trial_id=4, value=None)
Time taken to run Optuna study: 1.58 seconds
Using Optuna Library
The goal of this code is to use the Optuna library to perform hyperparameter tuning for a Decision Tree classifier and print out the optimal set of hyperparameters.
The direction argument of the optuna.create_study(direction='maximize') function generates a new study object for hyperparameter optimization, indicating that we wish to maximize the objective function (in this example, the accuracy).
Using the objective function objective and a total of 30 trials, the study_dt.optimize(objective, n_trials=30) function is used to optimize the Decision Tree classifier's hyperparameters. Optuna proposes a fresh set of hyperparameters for each trial and assesses how well they perform using the objective function. Based on the results of each trial, Optuna updates its internal model to guide the search towards promising regions of the hyperparameter space.
dt = DecisionTreeClassifier(max_features = study_dt.best_trial.params['max_features'], max_depth = study_dt.best_trial.params['max_depth'])
# Measure the time taken to fit the model
start_time = time.time()
dt.fit(X_train, Y_train)
fit_timedt = time.time() - start_time
# Measure the accuracy of the model on the training and test set
dt_train, dt_test = dt.score(X_train, Y_train), dt.score(X_test, Y_test)
# Measure the time taken to generate predictions
start_time = time.time()
y_pred = dt.predict(X_test)
predict_timedt = time.time() - start_time
# Calculate the precision score
precisiondt = precision_score(Y_test, y_pred, average='macro')
# Print the accuracy and time taken for fitting and predicting
print(f"Train Score: {dt_train}")
print(f"Test Score: {dt_test}")
print(f"Precision: {precisiondt}")
print(f"Time taken to fit the model: {fit_timedt:.10f} seconds")
print(f"Time taken to generate predictions: {predict_timedt:.10f} seconds")
Train Score: 1.0 Test Score: 0.9928557253423298 Precision: 0.9927331243171346 Time taken to fit the model: 0.0370340347 seconds Time taken to generate predictions: 0.0010011196 seconds
The average parameter must be set when the multi-class classification is done in order to use the precision_score function.
We utilize macro because it treats the two classes equally and reduces biases between them.
Precision demonstrates that it generates fewer unfavorable predictions.
Hence, when considering an intrusion detection system, is equally crucial.
Performace is improved after optimisation
fig = plt.figure(figsize = (30,12))
tree.plot_tree(dt, filled=True);
plt.show()
Now we carry out with other machine learning models WITH OPTIMISATION
2. K Nearest Neighbours
=> KNN is a non-parametric algorithm, meaning it does not make any assumptions about the underlying distribution of the data. Instead, it relies on the distances between data points to make predictions. It is supervised machine learning algorithm that can be used to solve both classification and regression problems using feature similarity making it popular for anomaly detection.
=> We also used Optuna to optimize it based on the parameter of K, which is the number of nearest neighbors.
def objective_KNN(trial):
n_neighbors = trial.suggest_int('KNN_n_neighbors', 2, 16, log=False)
classifier_obj = KNeighborsClassifier(n_neighbors=n_neighbors)
classifier_obj.fit(X_train, Y_train)
accuracy = classifier_obj.score(X_test, Y_test)
return accuracy
study_KNN = optuna.create_study(direction='maximize')
# Measure the time taken to run the Optuna study
start_time = time.time()
study_KNN.optimize(objective_KNN, n_trials=1)
optuna_timeknn = time.time() - start_time
# Print the best trial from the Optuna study and the time taken to run the study
print(f"Best trial: {study_KNN.best_trial}")
print(f"Time taken to run Optuna study: {optuna_timeknn:.2f} seconds")
[I 2023-04-23 20:45:13,465] A new study created in memory with name: no-name-7b4de6c0-5c01-4ff4-a697-006789fb227d [I 2023-04-23 20:45:13,900] Trial 0 finished with value: 0.9876959714229013 and parameters: {'KNN_n_neighbors': 2}. Best is trial 0 with value: 0.9876959714229013.
Best trial: FrozenTrial(number=0, state=TrialState.COMPLETE, values=[0.9876959714229013], datetime_start=datetime.datetime(2023, 4, 23, 20, 45, 13, 466677), datetime_complete=datetime.datetime(2023, 4, 23, 20, 45, 13, 900070), params={'KNN_n_neighbors': 2}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'KNN_n_neighbors': IntDistribution(high=16, log=False, low=2, step=1)}, trial_id=0, value=None)
Time taken to run Optuna study: 0.43 seconds
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score
import time
# Initialize the KNN classifier
KNN_model = KNeighborsClassifier(n_neighbors=study_KNN.best_trial.params['KNN_n_neighbors'])
# Measure the time taken to fit the model
start_time = time.time()
KNN_model.fit(X_train, Y_train)
fit_timeknn = time.time() - start_time
# Measure the accuracy of the model on the training and test set
KNN_train, KNN_test = KNN_model.score(X_train, Y_train), KNN_model.score(X_test, Y_test)
# Measure the time taken to generate predictions
start_time = time.time()
y_pred = KNN_model.predict(X_test)
predict_timeknn = time.time() - start_time
# Calculate the precision score
precisionKnn = precision_score(Y_test, y_pred, average='macro')
# Print the precision score and time taken for fitting and predicting
print(f"Precision: {precisionKnn}")
print(f"Train Score: {KNN_train}")
print(f"Test Score: {KNN_test}")
print(f"Time taken to fit the model: {fit_timeknn:.10f} seconds")
print(f"Time taken to generate predictions: {predict_timeknn:.10f} seconds")
Precision: 0.9871910929691968 Train Score: 0.9961792288989232 Test Score: 0.9876959714229013 Time taken to fit the model: 0.0890808105 seconds Time taken to generate predictions: 0.3292989731 seconds
3. logistics Regression
=> Logistic regression is a fast and accurate algorithm. It's a process of modelling the probability of a discrete outcome which is applicable in our scenario as Logistic Regression models a binary outcome of Normal or Anomalous.
=> We optimized it based on L2, which introduces a penalty term that encourages the model to have smaller values for the coefficients of the regression, effectively shrinking them towards zero to prevent overfitting.
def objective_lg(trial):
# Define the hyperparameters to optimize
C = trial.suggest_loguniform('C', 1e-5, 1e5)
penalty = trial.suggest_categorical('penalty', [ 'l2'])
# Train a logistic regression model with the chosen hyperparameters
lg_model = LogisticRegression(C=C, penalty=penalty)
lg_model.fit(X_train, Y_train)
# Evaluate the model on the test set and return the accuracy score as the objective value
Y_pred = lg_model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
return accuracy
study_lg = optuna.create_study(direction='maximize')
start_time = time.time()
study_lg.optimize(objective_lg, n_trials=30)
optuna_timelg = time.time() - start_time
# Print the best trial from the Optuna study
print(study_lg.best_trial)
print(f"Time taken to run Optuna study: {optuna_timelg:.2f} seconds")
[I 2023-04-23 20:45:16,142] A new study created in memory with name: no-name-94e002d3-ec75-4999-834c-6d59d8465ca4 [I 2023-04-23 20:45:16,281] Trial 0 finished with value: 0.9382814050406827 and parameters: {'C': 8780.611626370408, 'penalty': 'l2'}. Best is trial 0 with value: 0.9382814050406827. [I 2023-04-23 20:45:16,418] Trial 1 finished with value: 0.9390752133359793 and parameters: {'C': 0.03324545070797805, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:16,559] Trial 2 finished with value: 0.9390752133359793 and parameters: {'C': 0.03730722705679228, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:16,627] Trial 3 finished with value: 0.8880730303631673 and parameters: {'C': 0.0002012819933649707, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:16,690] Trial 4 finished with value: 0.8815241119269697 and parameters: {'C': 0.00017480789551259737, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:16,826] Trial 5 finished with value: 0.9376860488192101 and parameters: {'C': 2254.2699560358037, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:16,963] Trial 6 finished with value: 0.937487596745386 and parameters: {'C': 28.173389515071356, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,097] Trial 7 finished with value: 0.937487596745386 and parameters: {'C': 4.762822342955754, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,233] Trial 8 finished with value: 0.9370906925977377 and parameters: {'C': 26995.537064499757, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,371] Trial 9 finished with value: 0.9386783091883311 and parameters: {'C': 21643.76549376731, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,519] Trial 10 finished with value: 0.936296884302441 and parameters: {'C': 0.008959957617886588, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,664] Trial 11 finished with value: 0.9388767612621551 and parameters: {'C': 0.05940898701624244, 'penalty': 'l2'}. Best is trial 1 with value: 0.9390752133359793. [I 2023-04-23 20:45:17,805] Trial 12 finished with value: 0.9396705695574519 and parameters: {'C': 0.061558807112055705, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:17,945] Trial 13 finished with value: 0.9376860488192101 and parameters: {'C': 0.32253349486743915, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,084] Trial 14 finished with value: 0.9239928557253423 and parameters: {'C': 0.00286857412458809, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,141] Trial 15 finished with value: 0.870410795792816 and parameters: {'C': 2.241305724971338e-05, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,285] Trial 16 finished with value: 0.937487596745386 and parameters: {'C': 0.7634018627588752, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,429] Trial 17 finished with value: 0.9372891446715618 and parameters: {'C': 42.994936194977036, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,574] Trial 18 finished with value: 0.9289541575709466 and parameters: {'C': 0.004429016810856599, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,722] Trial 19 finished with value: 0.9384798571145069 and parameters: {'C': 0.1288290519982272, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:18,860] Trial 20 finished with value: 0.9368922405239135 and parameters: {'C': 1.7855707517446082, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,005] Trial 21 finished with value: 0.9388767612621551 and parameters: {'C': 0.03688953338408776, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,147] Trial 22 finished with value: 0.9386783091883311 and parameters: {'C': 0.023351739420480355, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,288] Trial 23 finished with value: 0.9384798571145069 and parameters: {'C': 0.14523410989014862, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,437] Trial 24 finished with value: 0.9222067870609247 and parameters: {'C': 0.0024947770898908605, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,579] Trial 25 finished with value: 0.9376860488192101 and parameters: {'C': 0.44722319257392157, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,718] Trial 26 finished with value: 0.9378845008930343 and parameters: {'C': 0.017868523737748908, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,863] Trial 27 finished with value: 0.937487596745386 and parameters: {'C': 4.8456933756212495, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:19,994] Trial 28 finished with value: 0.9087120460408811 and parameters: {'C': 0.0008256934493124494, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519. [I 2023-04-23 20:45:20,138] Trial 29 finished with value: 0.9388767612621551 and parameters: {'C': 0.06416284796890029, 'penalty': 'l2'}. Best is trial 12 with value: 0.9396705695574519.
FrozenTrial(number=12, state=TrialState.COMPLETE, values=[0.9396705695574519], datetime_start=datetime.datetime(2023, 4, 23, 20, 45, 17, 665406), datetime_complete=datetime.datetime(2023, 4, 23, 20, 45, 17, 805533), params={'C': 0.061558807112055705, 'penalty': 'l2'}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'C': FloatDistribution(high=100000.0, log=True, low=1e-05, step=None), 'penalty': CategoricalDistribution(choices=('l2',))}, trial_id=12, value=None)
Time taken to run Optuna study: 4.00 seconds
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
import time
# Initialize the logistic regression classifier
lg_model = LogisticRegression(C=study_lg.best_trial.params['C'], penalty=study_lg.best_trial.params['penalty'])
# Measure the time taken to fit the model
start_time = time.time()
lg_model.fit(X_train, Y_train)
fit_timelg = time.time() - start_time
# Measure the accuracy of the model on the training and test set
lg_train, lg_test = lg_model.score(X_train, Y_train), lg_model.score(X_test, Y_test)
# Measure the time taken to generate predictions
start_time = time.time()
y_pred = lg_model.predict(X_test)
predict_timelg = time.time() - start_time
# Calculate the precision score
precisionlg = precision_score(Y_test, y_pred, average='macro')
# Print the precision score and time taken for fitting and predicting
print(f"Precision: {precisionlg}")
print(f"Training Score: {lg_train}")
print(f"Test Score: {lg_test}")
print(f"Time taken to fit the model: {fit_timelg:.10f} seconds")
print(f"Time taken to generate predictions: {predict_timelg:.10f} seconds")
Precision: 0.9401861068895039 Training Score: 0.9440281843894209 Test Score: 0.9396705695574519 Time taken to fit the model: 0.1281154156 seconds Time taken to generate predictions: 0.0010001659 seconds
=> Metrics Chosen:
1. Precision
2. Time to Fit
3. Time to Predict
=> Why we Chose them?:
- Time taken to classify network accesses is crucial as the system should be able to detect an attack almost instantly to prevent simultaneous attacks
- Accuracy measures how often the model makes correct predictions
- Precision measures how many of the positive predictions made by the model are actually correct
We need to prevent as many false positives as possible as it can be costly in terms of time, resources, as they can trigger unnecessary alerts and investigations, causing distractions and delays. This can cause holes in the system which can possibly lead to more attacks.
data = [["KNN", KNN_train, KNN_test,precisionKnn,fit_timeknn,predict_timeknn,optuna_timeknn],
["Logistic Regression", lg_train, lg_test,precisionlg,fit_timelg,predict_timelg,optuna_timelg],
["Decision Tree", dt_train, dt_test,precisiondt,fit_timedt,predict_timedt,optuna_timedt]]
col_names = ["Model", "Train Score", "Test Score","Precision","Time to Fit","Time to Predict","Optimisation time"]
print(tabulate(data, headers=col_names, tablefmt="fancy_grid"))
╒═════════════════════╤═══════════════╤══════════════╤═════════════╤═══════════════╤═══════════════════╤═════════════════════╕ │ Model │ Train Score │ Test Score │ Precision │ Time to Fit │ Time to Predict │ Optimisation time │ ╞═════════════════════╪═══════════════╪══════════════╪═════════════╪═══════════════╪═══════════════════╪═════════════════════╡ │ KNN │ 0.996179 │ 0.987696 │ 0.987191 │ 0.0890808 │ 0.329299 │ 0.434393 │ ├─────────────────────┼───────────────┼──────────────┼─────────────┼───────────────┼───────────────────┼─────────────────────┤ │ Logistic Regression │ 0.944028 │ 0.939671 │ 0.940186 │ 0.128115 │ 0.00100017 │ 3.99606 │ ├─────────────────────┼───────────────┼──────────────┼─────────────┼───────────────┼───────────────────┼─────────────────────┤ │ Decision Tree │ 1 │ 0.992856 │ 0.992733 │ 0.037034 │ 0.00100112 │ 1.58179 │ ╘═════════════════════╧═══════════════╧══════════════╧═════════════╧═══════════════╧═══════════════════╧═════════════════════╛
Findings
=> Most accurate:
Decision Tree:
- Precision: 0.992733
- Time To Fit: 0.037034
- Time To Predict: 0.00100112
- Optimisation Time: 1.58179
Analysis
=> For the optimised models using the hyperparameters given by the tuning process, based on the results above the most accurate and precise model to use is the optimized decision tree which is our baseline model but optimised.
=>However, we need to take into account the optimization time, which is more than three times that of KNN since time taken for the whole model is important as intrusion detection is done in real-time.
Improving our Models
=> To improve time and accuracy, we suggest parallel processing algorithms and gradient boosting models such as XGBoost or sklearn. Gradient boosting improves accuracy by iteratively combining the predictions of weak learners, while parallel processing focuses on splitting the data and running the algorithms on multiple processors which will greatly speed up the models.
Our Conclusion
=> All models perform similarly with each with its own strengths as mentioned before in our analysis. => We believe that the next best steps for companies would be to use multiple models together and leverage their strenghts to improve speed and accuracy which are key for a detection system